"On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."
!pip install plotly
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
# import the scatter_matrix functionality
from pandas.plotting import scatter_matrix
#http://pandas.pydata.org/pandas-docs/stable/visualization.html
#http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/
#import python packages (these are the most popular ones)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
%matplotlib inline
print(sns.__version__) #check seaborn version
df = pd.read_csv("movie_metadata.csv")
The goal of this project is to predict movie success. Search Google using such key words as “predicting movie success” and understand the background of this prediction problem. Based on this research, write a summary of this business problem you’re trying to solve. Use bulleted lists and/or numbers in markdown cells. Answer the following questions as well:
• What are the project’s goals?
-The system is used to predict the past as well as the future of movie for the purpose of business certainty or simply a theoretical condition in which decision making (the success of the movie) is without risk.
• If you’re hired as a data/business analyst to predict how well a movie will perform in theaters, what kind of data would you collect?
-If I am hired as data/business analyst to predict how well a movie will perform in theaters,I will collect movie title, IMDB rating, plot description, budget, box office gross, opening weekend gross, the number of Academy Awards that actors and directors in each movie had won prior to that movie, and also the number of Best Picture films that actors and directors in each movie had been involved in, also prior to that movie.
• What variables are highly correlated to imdb score? In this project, you will use imdb_score to measure a movie’s success.
- Each IMDB rating comes with both a numeric rating and the number of votes cast.
df.corr()["imdb_score"]
According to the correlation analysis above,number of critics for reviews and number of voted users have high positive correlation with IMDb score.On the other hand,number of users for reviews and movie duration have weaker correlations with IMDb score than previous two.
The dataset contains a large number of variables with different types (e.g., numerical, categorial). Provide a brief summary of data understanding. Specifically, you need to: • Describe data • Identify data quality issues • Identify data types • Identify value counts of a selective list of columns considered to be important to predict a movie’s success (imdb_score)
len(df)
# Describe data
df.describe()
# Identify data quality issues
df.isnull().sum()
# Identify data types
df.info()
There are many incomplete values within the data set. It would be critical to drop all rows with null values. While correlations do not use objects,but only use integers,some object rows are actually insignificant.
df['title_year'].value_counts().head()
df['director_name'].value_counts().head()
df['duration'].value_counts().head()
df['num_voted_users'].value_counts().head()
df['budget'].value_counts().head()
df['content_rating'].value_counts().head()
Real-world datasets need to be pre-processed (e.g., cleaning, transforming) prior to formal analysis. Perform all necessary data cleaning and No reproduction/distribution permitted transformation activities. If necessary, you need to create new variables from existing variables. See an example.
df1 = df [['movie_title' ,'director_name','num_critic_for_reviews' , 'duration' , 'actor_1_facebook_likes' , 'gross', 'budget' ,'genres', 'actor_1_name',
'num_voted_users', 'cast_total_facebook_likes',
'movie_imdb_link', 'num_user_for_reviews', 'language',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'movie_facebook_likes']]
df1.head(2)
df1 = df.dropna()
len(df1)
df1['movie_title'].value_counts().head(5)
df1 = df1.drop_duplicates()
len(df1)
df1.corr()
df1.columns
df1.info()
df1 = df1.drop('color', axis=1)
After cleaning we have 3723 rows of data to analyze
df1.info()
Potentially, you can answer a lot of interesting questions using business intelligence techniques we’ve learned. The focus should be on what variables are good predictors for a movie’s success. You must use a variety of data visualization and business intelligence techniques. This is the most important component of this project. If this section is “too thin”, your project will receive a very low grade.
df.groupby(['imdb_score']).mean().T
df1['profit'] = df['gross'] - df['budget']
df1.groupby('title_year').count()['imdb_score'].plot(kind="barh", figsize=(14,9));
In modern times,more movies are shoot, over past 2 decades, include a higher volume of both higher-scoring and lower-scoring movies. The title year is insignificant to the IMDB_score.
dfprof = df1.sort_values(by=['profit'], ascending=False).head(50)
px.scatter(dfprof, x="budget", y="profit", title="How profit related to budget", trendline='ols', color="imdb_score", size="imdb_score", hover_data=['movie_title', 'director_name'])
dfdura = df1.sort_values(by=['imdb_score'], ascending=False).head(50)
px.scatter(dfdura, x="duration", y="imdb_score", title="How duration related to imdb_score", trendline='ols', color="imdb_score", size="imdb_score", hover_data=['movie_title', 'director_name'])
dfgross = df1.sort_values(by=['imdb_score'], ascending=False).head(50)
px.scatter(dfgross, x="gross", y="imdb_score", title="How gross related to imdb_score", trendline='ols', color="imdb_score", size="imdb_score", hover_data=['movie_title', 'director_name'])
The three graphs above show correlation between budget and profit,duration and imdb_score,budget and imdb_score. As conclusion imdb_score is not really affected by duration or money investment.
dfvote = df1.sort_values(by=['num_voted_users'], ascending=False).head(50)
px.scatter(dfvote, x="num_voted_users", y="imdb_score", title="How imdb_score related to num_voted_users", trendline='ols', color="imdb_score", size="imdb_score", hover_data=['movie_title'], labels={'num_voted_users':'User Votes'})
dirsc = df1.groupby('director_name').mean().sort_values(by='imdb_score', ascending=False).head(50)
dirprof = df1.groupby('director_name').mean().sort_values(by='profit', ascending=False).head(50)
dirsc = dirsc.reset_index()
dirprof = dirprof.reset_index()
px.bar(dirsc, x='director_name', y='imdb_score', color='imdb_score', title="Directors' scores", height=800)
px.bar(dirsc, x='director_name', y='budget', color='imdb_score', title="Movies' budget", height=800)
px.bar(dirsc, x='director_name', y='profit', color='imdb_score', title="Directors' profits", height=800)
sns.lmplot('imdb_score', 'title_year', data=df, fit_reg=True);
sns.lmplot('imdb_score', 'title_year', data=df, fit_reg=True);
sns.lmplot('imdb_score', 'num_critic_for_reviews', data=df, fit_reg=True);
sns.lmplot('imdb_score', 'actor_3_facebook_likes', data=df, fit_reg=True);
sns.lmplot('imdb_score', 'cast_total_facebook_likes', data=df, fit_reg=True);
sns.lmplot('cast_total_facebook_likes', 'imdb_score', data=df, fit_reg=True);
sns.lmplot("budget", "duration", df, x_jitter=.15)
df.groupby('country').hist('imdb_score');
sns.catplot("language", "num_critic_for_reviews", "imdb_score", data=df)
plt.figure(figsize=(14,7))
plt.title("Average IMDB Score by Genre")
sns.barplot(x='content_rating',y='imdb_score',data=df)
plt.xlabel('IMDB Score')
plt.ylabel("Genre");
This extends the previous section (business intelligence). • Perform correlation analysis and discuss the results. Again, what variables are correlated to imdb_score? How are some key variables correlated to each other?
df1.corr()
plt.figure(figsize=(8, 8))
sns.heatmap(df1.corr());
The director ,movies' duration, number of critics viewing a movie,to conclude,are critical to the imdb_score.Their correlations are useful to predict the movie to be successful or not.
At the end, this is what your client is interested in. Develop useful insights from your analysis. Write a summery using bulleted lists and/or numbers in markdown cells. If this section is “too thin”, your project will receive a low grade.
In this project, we developed a mathematical model to predict the success and failure of the upcoming movies based on several attributes.The terminal goal of this project is to find values to help determine box office success of a movie. Basically the movies corpotations have to know how to make more money. This database contains categorical and numerical information such as IMDb score, director, gross, budget and so on and so forth.
The most crucial factors to predict I think are
1.movie release year,recent movies are better,If the movie was to be released on a weekend, it was given higher weight because the chances of success were greater. If with the release of a movie, there was another high success movie released, a lower weight was given to the release time indicating that the chances of movie success were low due to the competition.
2.movie duration,longer movies are better
3.profit,High profit movies are better
4.budget,High budget movies are better, if a movies budget was below 5 million, the budget was given a lower weight
5.director_name,famous director's movies are better
6.number of critics reviewing a movie,more number of critics reviewing a movie are better movies
7.number of users reviewing a movie,more number of users reviewing a movie are better movies
8.number of users liking a movie,more number of users liking a movie are better movies.
To conclude,the things above are important for company to predict the success of movie,This project also show the power of predictive and prescriptive data analytics for information systems to aid movie business decisions. This model also helps to find out the review of the new movie.User can easily decide whether to book ticket in advance or not.
Big data forecasting is the core application of big data, and big data forecasting extends the traditional meaning forecast to “current measurement.” The advantage of big data forecasting is that it transforms a very difficult forecasting problem into a relatively simple description problem. This is something that traditional small data sets can't match. From the perspective of forecasting, the results of big data forecasting are not only simple and objective conclusions for dealing with real business, but also can be used to help business management decisions. Data can also be planned to guide the development of greater consumer power
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
# model validation
from sklearn.model_selection import train_test_split
#lasso regression
from sklearn import linear_model
#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from statsmodels.formula.api import ols
# install yellowbrick
!pip install yellowbrick
movievalues = pd.read_csv("movie_metadata.csv")
movievalues.head()
len(movievalues)
movievalues1 = movievalues [['movie_title' ,'director_name','num_critic_for_reviews' , 'duration' , 'actor_1_facebook_likes' , 'gross', 'budget' ,'genres', 'actor_1_name',
'num_voted_users', 'cast_total_facebook_likes',
'movie_imdb_link', 'num_user_for_reviews', 'language',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'movie_facebook_likes']]
movievalues1 = movievalues.dropna()
len(movievalues1)
# Identify data quality issues
movievalues1.isnull().sum()
# Identify data types
movievalues1.info()
sns.catplot('imdb_score', 'gross', data=movievalues1)
# correlation
movievalues1.corr()
movievalues1['content_rating'].unique()
movievalues1 = pd.get_dummies(movievalues1, columns=["content_rating"])
movievalues1.head(2)
#assigning columns to X and Y variables
y = movievalues1['imdb_score']
X = movievalues1.drop(['color','imdb_score','director_name','actor_2_name','actor_1_name','movie_title','genres','actor_3_name','plot_keywords','country','language', 'movie_imdb_link','title_year','aspect_ratio'], axis =1)
Dropped all categorical variables
X.columns
model1 = lm.LinearRegression()
model1.fit(X, y)
model1_y = model1.predict(X)
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
pd.DataFrame(list(zip(X.columns, np.transpose(model1.coef_))))
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
Multiple Regression
R Score: .363
Most Important Feature: content_rating_Approved
#create two datasets from the origianl
# train data
# test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print (len(movievalues1), len(X),len(y))
print (len(X_train), len(y_train), len(X_test),len(y_test))
model1 = lm.LinearRegression()
model1.fit(X_train, y_train)
#built
pred_y = model1.predict(X_test)
print("mean square error: ", mean_squared_error(y_test, pred_y))
print("variance or r-squared: ", explained_variance_score(y_test, pred_y))
Regression, SKL Learn
R Score: .336
# Fit the model below
model1 = lm.Lasso(alpha=0.1) #higher alpha (penality parameter), fewer predictors
model1.fit(X, y)
model1_y = model1.predict(X)
model1
print('Coefficients: ', model1.coef_)
print("y-intercept ", model1.intercept_)
pd.DataFrame(list(zip(X.columns, np.transpose(model1.coef_))))
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
Lasso Regression
R Score: .363
Most Important Feature: number of critics reviews
model2 = linear_model.Ridge() #higher alpha (penality parameter), fewer predictors
model2.fit(X, y)
model2_y = model2.predict(X)
print("mean square error: ", mean_squared_error(y, model2_y))
print("variance or r-squared: ", explained_variance_score(y, model2_y))
Ridge Regression
R Score: .363
2 variables
X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
selector = SelectKBest(f_regression, k=2).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)
X.head()
model2 = lm.LinearRegression()
model2.fit(X_new, y)
model2_y = model2.predict(X_new)
print("mean square error: ", mean_squared_error(y, model2_y))
print("variance or r-squared: ", explained_variance_score(y, model2_y))
Multiple Regression Two Variables
R Score: .279
3 variables
X_new = SelectKBest(f_regression, k=3).fit_transform(X, y)
model3 = lm.LinearRegression()
model3.fit(X_new, y)
model3_y = model3.predict(X_new)
print("mean square error: ", mean_squared_error(y, model3_y))
print("variance or r-squared: ", explained_variance_score(y, model3_y))
Multiple Regression Three Variables
R Score: .283
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
regr = RandomForestRegressor(n_estimators=100, random_state=0)
regr.fit(X, y)
regr_predicted = regr.predict(X)
print("mean square error: ", mean_squared_error(y, regr_predicted))
print("variance or r-squared: ", explained_variance_score(y, regr_predicted))
sorted(zip(regr.feature_importances_, X.columns))
Random Forest Regressor
R Score: .939
Most Important Feature: number of voted users/duration/budget
y = movievalues1['imdb_score']
#X = teams[['lstat', 'ptratio', 'indus']]
X = movievalues1[['num_voted_users','duration','budget','num_user_for_reviews','gross']]
regr = RandomForestRegressor(n_estimators=100, random_state=0)
regr.fit(X, y)
regr_predicted = regr.predict(X)
print("mean square error: ", mean_squared_error(y, regr_predicted))
print("variance or r-squared: ", explained_variance_score(y, regr_predicted))
sorted(zip(regr.feature_importances_, X.columns))
Ramdom Forest 5 Variables
R Score: .9328
Most Important Feature: number of voted users/duration/budget
# import packages
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier
#for validating your classification model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
# feature selection
from sklearn.feature_selection import chi2
# scikitplot for confusion matrix
import scikitplot as skplt
# Create a function to categorize IMDB scores
# The four categories are: <4, 4~6, 6~8 and 8~10, which represents bad, OK, good and excellent respectively
def func(x):
if 8 <= x <= 10: return 'excellent'
elif 6 <= x < 8: return 'good'
elif 4 <= x < 6: return 'OK'
else: return 'bad'
# create imdb_cat category column
df1['imdb_category'] = df['imdb_score'].apply(func)
# then drop imdb_score
df2 = df1.drop(['imdb_score'], axis=1)
df2.head()
# assign columns to X and Y variables
y = df2['imdb_category']
X = df2.drop(['imdb_category','content_rating','director_name','actor_2_name','actor_1_name','movie_title','genres','actor_3_name','plot_keywords','country','language', 'movie_imdb_link','title_year','aspect_ratio'],axis =1)
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X.columns
# build model
# Initialize DecisionTreeClassifier()
dt = DecisionTreeClassifier()
# Train a decision tree model
dt = dt.fit(X_train, y_train)
# model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))
68.66% accuracy
# show confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=dt.predict(X_test))
plt.show()
# remove highly correlated variables: cast_total_facebook_likes, num_user_for_reviews and profit
df1 = df.drop(['cast_total_facebook_likes'], axis=1)
df1 = df1.drop(['num_user_for_reviews'], axis=1)
# Dirctor names, actor names and movie titles are too different to be used for predicting IMDB scores,
# we will remove them before data modeling
df1 = df1.drop(['director_name'], axis=1)
df1 = df1.drop(['actor_1_name'], axis=1)
df1 = df1.drop(['actor_2_name'], axis=1)
df1 = df1.drop(['actor_3_name'], axis=1)
df1 = df1.drop(['movie_title'], axis=1)
df1 = df1.drop(['color'], axis=1)
df1= df1.dropna()
len(df1)
df1.info()
# initialize decision tree algorithm (without fitting)
scores = cross_val_score(DecisionTreeClassifier(), X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
# https://scikit-learn.org/stable/modules/cross_validation.html
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
y = df2['imdb_category']
X = df2.drop(['imdb_category','content_rating','director_name','actor_2_name','actor_1_name','movie_title','genres','actor_3_name','plot_keywords','country','language', 'movie_imdb_link','title_year','aspect_ratio'],axis =1)
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# find the best N
for i in range(1, 16):
# build model
knn = KNeighborsClassifier(n_neighbors=i) # default n_neighbors=5
# Train a decision tree model
knn = knn.fit(X_train, y_train)
# model evaluation
print("N =", i, "Accuracy =", metrics.accuracy_score(y_test, knn.predict(X_test)))
The best N is 15 which enabled the model to generate the highest accurancy rate 67.2%.
# build model
knn = KNeighborsClassifier(n_neighbors=6) # default n_neighbors=5
# Train a decision tree model
knn = knn.fit(X_train, y_train)
# model evaluation
print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
# Show confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test), y_pred=knn.predict(X_test))
plt.show()
# build model
lr = LogisticRegression(solver='lbfgs', max_iter=500)
lr.fit(X_train, y_train)
# evaluate model
print(metrics.accuracy_score(y_test, lr.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, lr.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, lr.predict(X_test)))
print("--------------------------------------------------------")
65.2% accuracy
from sklearn.ensemble import RandomForestClassifier
# build 20 decision trees
clf = RandomForestClassifier(n_estimators=20)
clf=clf.fit(X_train, y_train)
clf.score(X_test, y_test)
# evluate model
print(metrics.accuracy_score(y_test, clf.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, clf.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, clf.predict(X_test)))
75.0% accuracy
# show important features
fi=pd.DataFrame({'feature':X.columns, 'importance':clf.feature_importances_})
fi.sort_values('importance', ascending=False).head()
The most important variables respectively are num_voted_users, duration, num_user_for_reviews, budget and num_critic_for_reviews.
# import packages
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward
df1.info()
df3 = df1.drop(['genres','plot_keywords','movie_imdb_link','language','country','content_rating' ], axis=1)
df3.head()
df3 = df3.dropna()
df3.isnull().sum()
len(df3)
# normalize data and save as X
X = (df3 - df3.mean()) / (df3.max() - df3.min())
X.head()
df3.var()
# clustering analysis using k-means, k=2
#http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
k_means = KMeans(init='k-means++', n_clusters=2, random_state=0)
k_means.fit(X)
#clustering results
k_means.labels_
#cluster centroids or centers
k_means.cluster_centers_
# prepare dataframe for further analysis
# add cluster label into the dataset as a column
df_kmeans = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df_kmeans.head()
# now merge the cluster and the data set
# since we already reset the index, the df3.join should generate correct result
dfcd = df3.join(df_kmeans)
dfcd.head()
# total counts per cluster
dfcd.groupby('cluster').size()
# mean value of each cluster
dfcd.groupby('cluster').mean().T
#visualization
sns.lmplot("cluster", "num_voted_users", dfcd, x_jitter=.15, y_jitter=.15)
# setting random seed to get the same results each time.
np.random.seed(1)
# build model
agg= AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X)
agg.labels_
# Visualize dendrogram with p=2
plt.figure(figsize=(16,8))
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
linkage_matrix = ward(X)
dendrogram(linkage_matrix,
truncate_mode='lastp', # show only the last p merged clusters
p=2, # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # get a distribution impression in truncated branches
orientation="top")
plt.tight_layout() # fixes margins
# prepare dataframe for further analysis
# add cluster label into the dataset as a column
aggdf= pd.DataFrame(agg.labels_, columns = ['cluster'])
aggdf.head(3)
# now merge the cluster and the dataset
# since we already reset the index, the df.join should generate correct result
dfc = df3.join(aggdf)
dfc.tail(3)
# total counts per cluster
dfc.groupby('cluster').size()
# mean value of each cluster
dfc.groupby('cluster').mean().T
#visualization
sns.lmplot("cluster", "num_voted_users", dfc, x_jitter=.15, y_jitter=.15)
The clustering analysis with two clusters by using both K-Means algorithm and Agglomerative algorithm. Both models generate the similar results.
As one adjunct to data, the IMDb offers a rating scale that allows users to rate films on a scale of one to ten.IMDb indicates that submitted ratings are filtered and weighted in various ways in order to produce a weighted mean that is displayed for each film, series, and so on.
If you’ve ever picked up a camera, tried to wrangle actors, or sat down to write a script, you know that making movies is not an easy endeavor. There are a lot of moving pieces and continual obstacles to overcome, and even those who’ve been at it a while admit it’s difficult.
Every great movie is filled with obstacles that a hero must overcome to achieve his goal. Sometimes great obstacles, however, don’t just stay on the script page. Instead, they become part of a film’s actual production. Movies are epic endeavors, especially when they're helmed by filmmakers with grand visions. But along with those high standards and incredible goals come all sorts of production nightmares. The movies on this list almost didn't get made thanks to production struggles that brought the process to a halt.
After a script is written or an idea is pitched, a studio needs to agree to put up financing for a movie’s production. It’s hard to imagine now that a classic like Star Wars, which has generated billions of dollars, had trouble getting financing. Many industry insiders even thought it would become the “laughing stock of Hollywood.”
When looking what kind of movie that does best you want to make sure that it is in english and color. There are a number of direcotrs who have an everage score of over 7.5 (which what I would consider to be successful) see above for a list.
You can also when examining to see if the there is a correlation between different variables, you can see that certain directors are very correlated to a high imdb score. I also exaaminded to see if there was a correlation between the budget and the scoret which I found that higher budgeted movies tended to do better. While there were some outliers, the correlation analysis and charts told me this. I also focused on the content rating at the begining and decided that it was not nearly as important as I was orginally thinking it was going to be.
Important Factors when looking at how successful a movie will be
-the duration
-Number of reviews from critics
-the number of reviews from viewers
-The popularity of all actors involved
-The budget of the movie